Extending modularity by incorporating distance functions in the null model 
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Modularity is a widely used measure for evaluating community structure in networks. The defini- 
tion of modularity involves a comparison of the fraction of within-community edges in the observed 
network and a null model. In the original definition the null model only considers the node degree to 
rewire edges randomly, failing to be a good representation of many real- world networks. To handle 
this problem, we incorporate distance functions in the null model to facilitate edges between certain 
nodes while respecting the degree factor. This enables us to create a framework for generating 
appropriate modularities adapted to various networks. 



A network community is a group of nodes, within 
which edges are dense, but between which edges are 
sparse In practice, optimization methods are widely 
used for detecting communities in networks 0]. The ba- 
sic idea is to define a quantity measure for evaluating 
the "goodness" of a partition of the observed network 
into communities, and then to search through possible 
partitions for the one with the highest score. A variety 
of partition measures have been proposed, but the most 
famous one is known as the modularity 0. Formally, 
modularity is defined to be the fraction of edges within 
communities in the observed network minus the expected 
value of that fraction in a null model, which serves as a 
reference and should characterize some features of the 
observed network. In a mathematical expression, modu- 
larity reads 



pends on their degrees. However, this is not always the 
case for the network under observation, because other 
factors such as the distance between nodes can strongly 
affect their connections. For example, in spatial networks 
Q such as the Internet, road networks, and flight con- 
nections, long distance connections are always restricted 
due to financial cost or physical constraints. In social 
networks, individuals with a shorter distance with re- 
spect to their interests are more likely to be connected. 
Consequently, this null model may fail to be a valuable 
reference and result in a less acc urate p artition measure. 
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where ^ is a partition represented as a community- 
assignment vector on the right-hand side of the equation, 
with element U indicating the community membership of 
the ith node Vi, n is the number of nodes, m is the num- 
ber of edges, Aij is the element of the adjacency matrix 
A representing the number of edges between Vi and Vj in 
the observed network, is the expected value of that 
number in the null model, and <5(-,-) is the Kroncckcr's 
delta. 

In the original version of modularity proposed by New- 
man and Girvan (NG modularity) 0] , the null model pre- 
serves the degree sequence of the observed network and 
rewire edges randomly, giving 0, HI 
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We develop a new null mode l I . which takes the 
distance factor into consideration. In our null model, the 
number of edges between m and Vj is rewired according 
to the probability 
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where a G (0, +oo) is a field range parameter which will 
be explained later, dij > is the distance between Vi and 

Vj. 

Eq.Q can be interpreted by the data field idea pro- 
posed by Li [§], which introduces the field theories in 
Physics for describing interactions between particles into 
the data space. Now suppose each node exert forces on 
others by generating a field. The field theories (including 
the gravitational field, the electrostatic field, the magne- 
tostatic field, and etc.) say that the potential at a point 



kikj/2m, (2) 

Note* 

where ki = Y^ij=i Aij is the degree of v%. Note that in this 
null model the number of edges between nodes only de- 



In this paper we focus on undirected and unweighted networks. 
Developments for directed and/or weighted networks can be easily 
extended as in @;@]- 
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in space is directly proportional to the power of the field 
source (such as the mass or charge) , and decreases as the 
distance to the source increases. Based on this, we can 
calculate the potential at Vj of the field driven by Wj as 



(5) 



where fcj is supposed to be the power of Vi, and the dis- 
tance function f(d) = e^^l®) , falling in (0, 1], monoton- 
ically decreases with d. The parameter a reflects the in- 
teraction range of the field — If a is small, fid) decreases 
sharply, indicating a short-range field; If a is large, f(d) 
decreases slowly, indicating a long-range field. Note that 
the fields driven by different nodes can be superposed. 
Suppose the potential is a scalar quantity without direc- 
tion. The superposed potential at vj is 
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Then Eq.(T5]) can be rewritten as 

Vi{i)<Pj{i) 

<Amp(«) 

where we have used da = and dij = dji. That is, Pij is 
the the product of potentials at Vj of the fields driven by 
Vi and Vj separately, divided by the superposed potential 
at Vi. 

In the following, we describe the features of our null 
model. First, it can be found that 
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which implies that edges in our null model are undirected. 
Second, it is easy to derive that 
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and hence 
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That is, the number of edges of the observed network 
is preserved. Third, Eq.Q tells us that Py is positively 
related to ki and negatively related to dij . In other words, 
connections tend to link to high degree nodes and nodes 
with short distances. 

Based on our null model, we can define distance mod- 
ularity (dist-modularity) as 



Q dist (^,a) 
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P$ st )d(k,lj). (11) 



From Eq.(HDJl and (frT]), it is clear that Q dist e [-1,1] 
the same as NG modularity. 



Note that there is a range parameter a e (0, +oo) in 
Pfj lst , hence different a brings different dist-modularity. 
In one extreme case, i.e. a — > 0+, the range of the field 
driven by each node is so short that the potential only 
exists at the source node. This gives 



lim P 



dist 
ij 



h, ifi=j; 
0, otherwise. 



and hence 
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As a result, the Laplacian matrix L, the key matrix in 
graph partitioning [Hj and spectral clustering JOL and 
the modularity matrix B, defined to be A — P [J,l5|, are 
unified as a approaches — The only difference is the 
sign. 

In the other extreme case, i.e. a — > +oo, the range of 
the field driven by each node is so long that the potential 
at each node equals. This gives 



lim P dist = kikjjlm. 



(14) 



As a result, dist-modularity reduces to NG modularity. 

It is interesting to note that as a increases from 
to +oo, optimizing dist-modularity brings community 
structure at different scales, from coarse to fine. First, 
lim Q dlst can be rewritten as 
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where c is the number of communities, C g is the c/th com- 
munity. Eq. (rr"5|) implies that lim Q dlst is optimized to 

when the network is divided into only one community 
or several communities corresponding to its connected 
components. Obviously this is the community structure 
at the coarse scale. Second, optimizing lim Q dlst , i.e. 

a— y +oo 

optimizing NG modularity results in community struc- 
ture at the fine scale. Third, as a ranges from to +oo, 
optimizing Q dlst brings multi-scale community structures 
that fall between the above two extremes. 

One issue that has not been addressed is the distance 
between nodes. For plain networks with only link infor- 
mation on nodes, dij can be calculated from the i-th and 
j-th rows of the adjacency matrix A. Many networks in 
real world have additional information on nodes (node at- 
tributes), such as the geographical position of a location 
or the profile of a person. For these networks, dij can be 
calculated from the attributes of and Vj using a specific 
vector distance measure, such as the Euclidean distance, 
the Manhattan distance and the Minkowski distance. 

So far we have proposed a new null model by incor- 
porating a distance function. In the following, we show 
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how this null model can be generalized to produce a fam- 
ily of dist-modularities for various networks. Now let us 
look back at Eq. §5§ and (JSJ) , the expression of the poten- 
tials, which is the core component of our null model. It 
can be found that the power of the field source and the 
distance function are specifically specified as the node de- 
gree and e~{ d l°} , respectively. However, we have many 
more choices while preserving the desired features of the 
null model. Suppose the power of and the distance 
function are denoted by N{ and f(d), respectively. Then 



<Pi(j) = N i f(d ij ), 



<Am P (j) = ^N t f(d tj ). 

t=i 
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Substituting Eq.® and (T7) into Eq.©, we have 

13 Et=iNtf(duY 
Finally, with Eq.Q and (|T8|l we can derive that 

n n 

I>- ( 19 ) 
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Eq. (HU1) implies that the number of edges of the observed 
network is preserved if X)"=i ~ ^m is satisfied. This 
can be easily achieved by normalizing Ni as 

Ni = 2m. (20) 

Due to the above developments, there is a large free- 
dom in specifying iVj and f(d). For example, Ni can 
be equally distributed among all nodes, i.e. specifying 
Ni = 2m/ n. In networks with node attributes, N can be 
specified as the most representative attribute, or a com- 
prehensive index obtained from several attributes. As for 
f(d), the common choices are 

f(d) = e-w°r, 



f(d) = l/(l + (d/a) r ), 
/(d) = 1= 



/(d) = 



1, if d < er; 
0, otherwise, 



(21) 
(22) 
(23) 

(24) 



where a G (0, +oo) and r > are parameters. Note that 
functions (|21 |) -([24" |) arc monotonically decreasing in the 
domain d G [0, +oo). However, this is not a necessary 
constraint. For example, f(d) can be learned from the 
observed network as 



f(d) = ( E 



(25) 



with a binning procedure to smooth the function. f(d) 
can even be replaced by a similarity function that mono- 
tonically increases with d. 

The freedom in specifying N and f(d) enables us 
to create a framework to produce a family of dist- 
modularities. Within this framework, it is interesting to 
see some relations with previous work. For example, 1) 
When N = 2m/ n and f (d) = 1, we have = 2m/ n 2 , 
indicating the Erdos-Renyi random graph [13j; 2) When 



Ni = ki and /(d) = 1, we have P q 



dist 



kikj/2m, indicat- 



ing the null model of NG modularity. 

In conclusion, we incorporate distance functions in the 
null model to capture the features of real-world networks. 
Taking this null model as a reference for comparing the 
fraction of within-community edges with the observed 
network, we create a framework for generating a family 
of dist-modularities adapted to various networks, includ- 
ing networks with node attributes. In addition, we have 
several interesting findings within this framework as be- 
low. 

• Lapalacian matrix and modularity matrix can be 
unified, providing an in-depth view of the close rela- 
tionship between graph partitioning/spectral clus- 
tering and community detection; 

• NG modularity can be exactly recovered as a spe- 
cial case of dist-modularity. 

• Dist-modularity can be used to detect communities 
at different scales, from coarse to fine. 



RELATED WORK 

The study of community detection in networks has a 
long history. It is closely related to graph partitioning 
[l(J in computer science, and hierarchical clustering 14 1 
in sociology. In the past decade, this study has attracted 
a great deal of interests and various methods were pro- 
Dosed I EMI. In particular, modularity optimization 




is widely used despite its intrinsic limits [27 



Recently, some researchers considered community de- 
tection in networks with node attributes. Expert et al. 
[lU and Cerina et al. proposed methods by factor- 
ing out the effect of space in spatial networks where the 
geographical information on nodes in available Q . Yang 
et al. devised a discriminative approach for combining 
the edge and node attributes [31 1. 
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